Nikon Multimedia Event Detection System
نویسندگان
چکیده
This note presents Nikon’s approach to the multimedia event detection (MED) task of TRECVID 2010. We explain the basic concept of our system which detects events by multiple keyframes extraction based on scene length. We describe the algorithm in detail and show experimental results with the MED dataset. Our simple system got the third place among seven teams of participants. 1 Basic Concept We rely on the assumption that a small number of images in a given video contains enough information for event detection. With this assumption, we reduce the event detection task to the classification problem for a set of images, which we call keyframes, by sampling images that potentially represent necessary information. The keyframes extraction is based on a scene cut detection technique, and the assumption that the longer a scene is, the more relevant information it contains. The classification step employs the bag-ofwords (BoW) framework [1] based on the SIFT descriptor [2]. The obtained histogram is fed into the support vector machine (SVM), which is trained over the training dataset for each event category. We refer to our system as Nikon multimedia event detection (MED) system in this note. 2 Detailed Description of Nikon MED System Nikon MED system consists of the following five major steps: 1. Create a space-time (ST) image from a video. 2. Perform scene-cut detection based on the ST image. 3. Extract keyframes from each scene. 4. Construct a BoW histogram from the set of keyframes. 5. Classify the histogram by SVM. The details of each step are described in the following subsections. Our implementation is built on the OpenCV library [3]. Stack into a 1D vertical vector. time (frame index) space index space-time image image size: 996x1200 file name: HVC2356.mp4 frame size: 504x284 duration: 5m51s FPS: 30 input video (a) Space-time image creation. time (frame index) space index Red: Detected vertical edges by Canny detector. Blue: Frames with more than 1/60 votes. Green: Scene-cut points after 2 sec. interval rejection. (b) Scene-cut detection with the space-time image. Fig. 1: Space-time (ST) image creation (a) and scene-cut detection (b). The ST image is created by stacking 2D pixel array of each frame, sampled at regular intervals of a video, into a long 1D vertical vector. 2.1 Step 1: Space-time Image Creation Guimarães et al. proposed a scene-cut method, called a visual rhythm [4]. It extracts pixel values on the two diagonal lines from each frame, and stacks them into a 1D vector. The obtained 1D vectors are concatenated to form a 2D space-time (ST) image. Scene-cut is performed by applying a vertical edge detector to the ST image. We follow their spirit, and adopt a more robust way; we utilize all pixel values. A given video is converted to a large ST image by sampling frames at every 0.5 second, and unfolding the 2D structure of images into 1D vector (see Figure 1). Before applying this procedure, we convert a color video into a gray one, trim the frame into 4:3, and resize it to 40× 30 pixels. Thus, the size of an ST image is bduration · FPS · 0.5c × 1200, where b·c denotes the floor of a real number. (a) 1. Select the longest scene (M=1). 2. Exclude dark frames. 3. Extract the center of the frames (N=1). Key-(1,1) (b) 1. Select the longest scene (M=1). 2. Exclude dark frames. 3. Extract N-frames on a regular grid (N=3). Key-(1,N) (c) 1. Select the M-longest scenes (M=3). 2. Exclude dark frames. 3. Extract the center frames of the M-scenes (N=1). Key-(M,1) 1. Select the M-longest scenes (M=3). 2. Exclude dark frames. 3. Extract N-frames of the M-scenes (N=3). Key-(M,N) (d) Fig. 2: Keyframes extraction based on scene-cut. 2.2 Step 2: Scene-cut Detection Scene-cut detection is performed by finding vertical lines in the ST image. We first extract vertical edges by using the Canny detector [5], and then find vertical lines based on the Hough voting, which is in this case simply the sum of the number of the detected vertical edges at each time frame. We consider the time frames which got more than 20 votes (1/60 of the 1200 pixels lying in the same time frame of the ST image) to be scene-cut points, neglecting the time frames within 2 seconds after the previous scene-cut point. See Figure 1 for illustration. 2.3 Step 3: keyframes Extraction We denote by Key-(M,N) the keyframes extraction where N keyframes are extracted from each of the M longest scenes. See Figure 2 for examples. We exclude the dark frames whose average brightness is less than 80/256, before we sample N frames from each scene with li/(N + 1) intervals. Here li denotes the length of i-th scene. In case that the number of the frames after the dark frame exclusion is less than N , supplementary frames are extracted from shorter scenes. 2.4 Step 4: Bag-of-words (BoW) Histogram Construction We represent a set of keyframes with a bag-of-words (BoW) histogram based on SIFT descriptors at interest points. We trim each of the keyframes into 4:3, and resize it to 320 × 240 pixels, before SIFT descriptor extraction, for which we adopted Sande’s software [6]. The code-book with 1000 visual words is created by K-means with all the extracted SIFT descriptors from all the keyframes over the training set, unless the total number of descriptors is more than nlim = 2 ≈ 4 × 10. In case that more descriptors are found, we randomly choose nlim descriptors for memory limitation.
منابع مشابه
NTTFudan Team at TRECVID 2016: Multimedia Event Detection
The TRECVID 2016 Multimedia Event Detection (MED) challenge evaluates the detection performances of high level complex events in Internet videos with limited number of positive training examples [1]. In this notebook paper, we present an overview of our system, highlighting on the selection and fusion of multiple classification models from a wide range of feature representations to improve the ...
متن کاملSRI-Sarnoff AURORA System at TRECVID 2012 Multimedia Event Detection and Recounting
In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level featu...
متن کاملIi a N Overview of the Submitted Runs in the Paired S Emantic I Ndexing Task
Our experiments in TRECVID 2013 include participation in the Semantic Indexing (SIN), Multimedia Event Detection (MED), and Multimedia Event Recounting (MER) tasks. In Semantic Indexing, we participated in the main and paired tasks. We implemented linear and non-linear SVM-based classifiers on six visual features extracted from the main keyframes and also additional frames from longer shots. We...
متن کاملAudio-concept features and hidden Markov models for multimedia event detection
Multimedia event detection (MED) on user-generated content is the task of finding an event, e.g., a Flash mob or Attempting a bike trick, using its content characteristics. Recent research has focused on approaches that use semantically defined “concepts” trained with annotated audio clips. Using audio concepts allows us to show semantic evidence of their relationship to events, by looking at t...
متن کاملIBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems
For this year’s TRECVID Multimedia Event Detection task [11], our team studied a semantic approach to video retrieval. We constructed a faceted taxonomy of 1313 visual concepts (including attributes and dynamic action concepts) and 85 audio concepts. Event search was performed via keyword search with a human user in-the-loop. Our submitted runs included PreSpecified and Ad-Hoc event collections...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010